{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hackathon 7\n",
    "\n",
    "Written by Eleanor Quint\n",
    "with images sourced from [Isaac Dykeman's blog post on CVAE](http://ijdykeman.github.io/ml/2016/12/21/cvae.html).\n",
    "\n",
    "Topics:\n",
    "- Stochastic Computation Graphs\n",
    "- Variational Autoencoder\n",
    "    - Reparameterization trick\n",
    "- Visualization with t-sne\n",
    "\n",
    "This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# We'll start with our library imports...\n",
    "from __future__ import print_function\n",
    "\n",
    "import os  # to work with file paths\n",
    "\n",
    "import tensorflow as tf         # to specify and run computation graphs\n",
    "import numpy as np              # for numerical operations taking place outside of the TF graph\n",
    "import matplotlib.pyplot as plt # to draw plots\n",
    "\n",
    "mnist_dir = '/work/cse496dl/shared/hackathon/03/mnist/'\n",
    "cifar_dir = '/work/cse496dl/shared/hackathon/05/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# extract our dataset, MNIST\n",
    "train_data = np.load(mnist_dir + 'mnist_train_images.npy')\n",
    "train_data = np.reshape(train_data, [-1, 28, 28, 1])\n",
    "test_data = np.load(mnist_dir + 'mnist_test_images.npy')\n",
    "test_data = np.reshape(test_data, [-1, 28, 28, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stochastic Computation Graphs\n",
    "\n",
    "Up to this point, we've been working entirely with deterministic transformations of data between high dimensional vector spaces. That means we assign one code to one image. We can also transform the same image vector into a 10 dimensional distribution over classes in a classifier as in classification. As hinted at by all of the statistical terms we've been using (cross-entropy, distribution over classes, etc.), the theory underlying machine learning and deep learning is based on statistics.\n",
    "\n",
    "For example, we think of our dataset as an [iid](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) sample of a distribution of interest (e.g., natural images, spectrograms of human voices, or EEGs taken while looking at cute puppies). Then, in classification, we transform each sample of that data distribution using our parametric network into a sample of an associated distribution over classes. Thus, we can think of our networks as functions that [transform](http://math.bme.hu/~nandori/Virtual_lab/stat/dist/Transformations.pdf) one random variable into another or parameterize conditional distributions like a [Bayesian network](https://en.wikipedia.org/wiki/Bayesian_network#Inference_and_learning).\n",
    "\n",
    "Once we take this perspective of deep learning as learning parametric functions that transform random variables, we can think about using random variables in the middle of the graph for many reasons. We'll look at replacing the static codes of the autoencoders we studied last week with a random variable, creating the [variational autoencoder](https://arxiv.org/abs/1606.05908) (VAE), which allows us to generate data by sampling codes and transforming them into data.\n",
    "\n",
    "\n",
    "#### Variational Autoencoder\n",
    "\n",
    "Generally autoencoders associate individual latent codes with each datum. This gives us a powerful ability to transform small codes into full-sized data. Instead of individual codes, the VAE uses distributions over codes (typically Gaussian). Let's specify a stochastic encoder that, rather than a single code, outputs the parameters of the posterior distribution, a Gaussian (Normal) distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def conv_block(inputs, filters, downscale=1):\n",
    "    \"\"\"\n",
    "    Args:\n",
    "        - inputs: 4D tensor of shape NHWC\n",
    "        - filters: int\n",
    "    \"\"\"\n",
    "    with tf.name_scope('conv_block') as scope:\n",
    "        conv = tf.layers.conv2d(inputs, filters, 3, 1, padding='same')\n",
    "        down_conv = tf.layers.conv2d(conv, filters, 3, strides=downscale, padding='same')\n",
    "        return down_conv\n",
    "\n",
    "def gaussian_encoder(inputs, latent_size):\n",
    "    \"\"\"inputs should be a tensor of images whose height and width are multiples of 4\"\"\"\n",
    "    x = conv_block(inputs, 8, downscale=2)\n",
    "    x = conv_block(x, 16, downscale=2)\n",
    "    mean = tf.layers.dense(x, latent_size)\n",
    "    log_scale = tf.layers.dense(x, latent_size)\n",
    "    return mean, log_scale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This allows us to encode input images to certain regions of latent space. <img src=\"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR7ICq0ljD0bX-_pusVuPeof-7rMunTpgBoSmL25cgo12AFGF8H\" width=\"70%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### The Reparameterization Trick\n",
    "\n",
    "The real brilliance behind the VAE is the idea to reparameterize the latent variable. Rather than using the outputs of the previous layer directly as the parameters of a Gaussian, we'll instead compute a reparameterized formula that is equivalent, but has lower variance gradients:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def gaussian_sample(mean, log_scale):\n",
    "    # noise is zero centered and std. dev. 1\n",
    "    gaussian_noise = tf.random_normal(shape=tf.shape(mean))\n",
    "    return mean + (tf.exp(log_scale) * gaussian_noise)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we'll specify a deterministic decoder that transforms from the latent variable back to the data variable, just as in any other autoencoder:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from operator import mul\n",
    "from functools import reduce\n",
    "\n",
    "def upscale_block(x, scale=2, name='upscale_block'):\n",
    "    \"\"\"[Sub-Pixel Convolution](https://arxiv.org/abs/1609.05158) \"\"\"\n",
    "    n, w, h, c = x.get_shape().as_list()\n",
    "    x = tf.layers.conv2d(x, c * scale ** 2, (3, 3), activation=tf.nn.relu, padding='same', name=name)\n",
    "    output = tf.depth_to_space(x, scale)\n",
    "    return output\n",
    "\n",
    "def decoder(inputs, output_shape):\n",
    "    \"\"\"output_shape should be a length 3 iterable of ints\"\"\"\n",
    "    h, w, c = output_shape\n",
    "    initial_shape = [h // 4, w // 4, c]\n",
    "    initial_size = reduce(mul, initial_shape)\n",
    "    x = tf.layers.dense(inputs, initial_size // 32, name='decoder_dense')\n",
    "    x = tf.reshape(x, [-1] + initial_shape)\n",
    "    x = upscale_block(x, name='upscale1')\n",
    "    return upscale_block(x, name='upscale2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This allows us to specify the decoder which will produce good digits from codes sampled from the unit normal distribution. <img src=\"http://ijdykeman.github.io/assets/cvae_figures/vae_decoder_diagram.svg\" width=\"70%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### VAE loss\n",
    "\n",
    "The VAE loss is the sum of two components: latent loss and reconstruction loss.\n",
    "\n",
    "The latent loss penalizes each [posterior distribution](https://en.wikipedia.org/wiki/Posterior_probability) for its KL-divergence from the [prior distribution](https://en.wikipedia.org/wiki/Prior_probability). This pushes the model to assign codes so that we can sample from the prior distribution and transform it into the data distribution, and regularizes the model (note that we use no other regularization in the model below). We've chosen the standard normal (Gaussian) distribution, `N(0,1)` to be the prior, so we use `std_gaussian_KL_divergence` to calculate the KL divergence.\n",
    "\n",
    "<img src=\"http://ijdykeman.github.io/assets/cvae_figures/kl_divergence_diagram.svg\" width=\"80%\">\n",
    "\n",
    "The reconstruction loss is a probabilistic approach to calculating the distance between two images. Instead of interpreting the output of the decoder as a static image, we interpret it as the mean parameter of a distribution and then calculate the log-likelihood of the data under that distribution. I've provided `bernoulli_logp` and `discretized_logistic_logp` as two basic options, but other distributions can be used as well. [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) seems like a natural choice for greyscale images with its single parameter equal to the decoder output, but [discretized logistic](https://en.wikipedia.org/wiki/Logistic_distribution) has been used by some powerful models lately. It uses the decoder output as the distribution mean and learns the scale as a separate parameter over the entire dataset. Note that these distributions are usually only used in the calculation of the reconstruction loss, with the decoder output taken as the sampled image.\n",
    "\n",
    "Finally, the provided `vae_loss` function provides a neat interface to all this calculation and returns a tuple of the losses, which can be summed to get the total VAE loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define an epsilon\n",
    "EPS = 1e-10\n",
    "\n",
    "def std_gaussian_KL_divergence(mu, log_sigma):\n",
    "    \"\"\"Analytic KL distance between N(mu, e^log_sigma) and N(0, 1)\"\"\"\n",
    "    sigma = tf.exp(log_sigma)\n",
    "    return -0.5 * tf.reduce_sum(\n",
    "        1 + tf.log(tf.square(sigma)) - tf.square(mu) - tf.square(sigma), 1)\n",
    "\n",
    "\n",
    "def flatten(inputs):\n",
    "    \"\"\"\n",
    "    Flattens a tensor along all non-batch dimensions.\n",
    "    This is correctly a NOP if the input is already flat.\n",
    "    \"\"\"\n",
    "    if len(shape(inputs)) == 2:\n",
    "        return inputs\n",
    "    else:\n",
    "        size = inputs.get_shape().as_list()[1:]\n",
    "        return tf.reshape(inputs, [-1, size])\n",
    "\n",
    "def bernoulli_logp(alpha, sample):\n",
    "    \"\"\"Calculates log prob of sample under bernoulli distribution.\n",
    "    \n",
    "    Note: args must be in range [0,1]\n",
    "    \"\"\"\n",
    "    alpha = flatten(alpha)\n",
    "    sample = flatten(sample)\n",
    "    return tf.reduce_sum(sample * tf.log(EPS + alpha) +\n",
    "                         ((1 - sample) * tf.log(EPS + 1 - alpha)), 1)\n",
    "\n",
    "def discretized_logistic_logp(mean, logscale, sample, binsize=1 / 256.0):\n",
    "    \"\"\"Calculates log prob of sample under discretized logistic distribution.\"\"\"\n",
    "    scale = tf.exp(logscale)\n",
    "    sample = (tf.floor(sample / binsize) * binsize - mean) / scale\n",
    "    logp = tf.log(\n",
    "        tf.sigmoid(sample + binsize / scale) - tf.sigmoid(sample) + EPS)\n",
    "\n",
    "    if logp.shape.ndims == 4:\n",
    "        logp = tf.reduce_sum(logp, [1, 2, 3])\n",
    "    elif logp.shape.ndims == 2:\n",
    "        logp = tf.reduce_sum(logp, 1)\n",
    "    return logp\n",
    "\n",
    "def vae_loss(inputs, outputs, latent_mean, latent_log_scale, output_dist, output_log_scale=None):\n",
    "    \"\"\"Calculate the VAE loss (aka [ELBO](https://arxiv.org/abs/1312.6114))\n",
    "    \n",
    "    Args:\n",
    "        - inputs: VAE input\n",
    "        - outputs: VAE output\n",
    "        - latent_mean: parameter of latent distribution\n",
    "        - latent_log_scale: log of std. dev. of the latent distribution\n",
    "        - output_dist: distribution parameterized by VAE output, must be in ['logistic', 'bernoulli']\n",
    "        - output_log_scale: log scale parameter of the output dist if it's logistic, can be learnable\n",
    "        \n",
    "    Note: output_log_scale must be specified if output_dist is logistic\n",
    "    \"\"\"\n",
    "    # Calculate reconstruction loss\n",
    "    # Equal to minus the log likelihood of the input data under the VAE's output distribution\n",
    "    if output_dist == 'bernoulli':\n",
    "        outputs = tf.sigmoid(outputs)\n",
    "        reconstruction_loss = -bernoulli_logp(outputs, inputs)\n",
    "    elif output_dist == 'logistic':\n",
    "        outputs = tf.clip_by_value(outputs, 1 / 512., 1 - 1 / 512.)\n",
    "        reconstruction_loss = -discretized_logistic_logp(outputs, output_log_scale, inputs)\n",
    "    else:\n",
    "        print('Must specify an argument for output_dist in [bernoulli, logistic]')\n",
    "    reconstruction_loss = tf.reduce_mean(reconstruction_loss)\n",
    "        \n",
    "    # Calculate latent loss\n",
    "    latent_loss = std_gaussian_KL_divergence(latent_mean, latent_log_scale)\n",
    "    latent_loss = tf.reduce_mean(latent_loss)\n",
    "    \n",
    "    return reconstruction_loss, latent_loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's specify the full VAE using the logistic output distribution: <img src=\"http://ijdykeman.github.io/assets/cvae_figures/vae_diagram.svg\" width=\"80%\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "latent_size = 10\n",
    "img_shape = [28, 28, 1]\n",
    "\n",
    "tf.reset_default_graph()\n",
    "inputs = tf.placeholder(tf.float32, shape=[None] + img_shape)\n",
    "\n",
    "# VAE\n",
    "with tf.variable_scope(\"model\") as scope:\n",
    "    means, log_scales = gaussian_encoder(inputs, latent_size)\n",
    "    codes = gaussian_sample(means, log_scales)\n",
    "    outputs = decoder(codes, img_shape)\n",
    "with tf.variable_scope(\"model\", reuse=True) as scope:\n",
    "    gen_sample = decoder(codes, img_shape)\n",
    "\n",
    "# calculate loss with learnable parameter for output log_scale\n",
    "output_log_scale = tf.get_variable(\"output_log_scale\", initializer=tf.constant(0.0, shape=img_shape))\n",
    "reconstruction_loss, latent_loss = vae_loss(inputs, outputs, means, log_scales, 'logistic', output_log_scale)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And train the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# setup optimizer\n",
    "total_loss = reconstruction_loss + latent_loss\n",
    "optimizer = tf.train.AdamOptimizer()\n",
    "train_op = optimizer.minimize(total_loss)\n",
    "\n",
    "# train for an epoch and visualize\n",
    "batch_size = 16\n",
    "session = tf.Session()\n",
    "session.run(tf.global_variables_initializer())\n",
    "for epoch in range(5):\n",
    "    for i in range(train_data.shape[0] // batch_size):\n",
    "        batch_xs = train_data[i*batch_size:(i+1)*batch_size, :]\n",
    "        session.run(train_op, {inputs: batch_xs})\n",
    "print(\"Done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we'll run a test to check out how it performs. First we encode an input image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# run a test\n",
    "idx = np.random.randint(test_data.shape[0])\n",
    "inputs_data = np.repeat(np.expand_dims(test_data[idx], axis=0), 3, axis=0)\n",
    "inputs_out, output_log_scale, outputs_out = session.run([inputs, output_log_scale, outputs], {inputs: inputs_data})\n",
    "print(\"Input image\")\n",
    "plt.imshow(np.squeeze(inputs_out[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we'll sample from the posterior distribution a few times and visualize the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Show reconstructions\n",
    "print(\"Reconstruction\")\n",
    "fig=plt.figure(figsize=(10, 10))\n",
    "columns = 3\n",
    "rows = 3\n",
    "for i in range(rows):\n",
    "    for j in range(columns):\n",
    "        if i == j:\n",
    "            img = np.squeeze(outputs_out[i])\n",
    "        else:\n",
    "            img = np.squeeze(outputs_out[i]) - np.squeeze(outputs_out[j])\n",
    "        fig.add_subplot(columns, rows, (i*rows) + j + 1)\n",
    "        plt.imshow(img)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The learned standard deviation of the logistic output distribution is interesting to visualize. Essentially, it is the learned uncertainty over the dataset per pixel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(\"Output distribution standard deviation\")\n",
    "plt.imshow(np.squeeze(np.exp(output_log_scale)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Hackathon 7 Exercise\n",
    "\n",
    "Write code that samples from the VAE by sampling from the prior distribution, decoding with the above learned decoder, and visualizing the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### t-SNE visualization\n",
    "\n",
    "(from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html))\n",
    "\n",
    "t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.\n",
    "\n",
    "t-SNE will be initialized with the embedding that is generated by PCA in this example, which is not the default setting. It ensures global stability of the embedding, i.e., the embedding does not depend on random initialization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from time import time\n",
    "\n",
    "import sklearn\n",
    "\n",
    "flat_train_data = np.reshape(train_data, [-1, 784])\n",
    "print(\"Computing t-SNE embedding\")\n",
    "tsne = sklearn.manifold.TSNE(n_components=2, init='pca')\n",
    "t0 = time()\n",
    "X_tsne = tsne.fit_transform(flat_train_data)\n",
    "\n",
    "plot_embedding(X_tsne,\n",
    "               \"t-SNE embedding of the digits (time %.2fs)\" %\n",
    "               (time() - t0))\n",
    "\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "TensorFlow 1.12 (py36)",
   "language": "python",
   "name": "tensorflow-1.12-py36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
